Skip to content

LCORE-1723: Cross-Encoder Reranking for enhanced RAG#1566

Open
Anxhela21 wants to merge 3 commits intolightspeed-core:mainfrom
Anxhela21:anx/vector_compare
Open

LCORE-1723: Cross-Encoder Reranking for enhanced RAG#1566
Anxhela21 wants to merge 3 commits intolightspeed-core:mainfrom
Anxhela21:anx/vector_compare

Conversation

@Anxhela21
Copy link
Copy Markdown
Contributor

@Anxhela21 Anxhela21 commented Apr 21, 2026

Description

Cross-Encoder Reranking for Enhanced RAG Quality

Summary

Implemented neural reranking using cross-encoder models to improve RAG chunk relevance scoring. This enhancement replaces simple vector similarity with semantic relevance scoring when combining results from multiple knowledge sources (BYOK + OKP) that have different search types e.g. hybrid, semantic, etc

Key Features

  • Cross-encoder model integration using sentence-transformers
  • Configurable model selection (default: cross-encoder/ms-marco-MiniLM-L6-v2)
  • Query-document pair scoring for improved semantic relevance
  • Efficient model caching to avoid reloading between requests

Flexible Configuration

reranker:
enabled: true
model: "cross-encoder/ms-marco-MiniLM-L6-v2"
top_k_multiplier: 2.0 # Fetch 2x chunks, rerank, keep top_k
byok_boost: 1.2 # Boost factor for BYOK sources
okp_boost: 1.0 # Boost factor for OKP sources

Auto-Enable

  • Automatically enables when both BYOK and OKP RAG sources are configured
  • Only when user hasn't explicitly configured reranker settings (see above)

Error Handling

  • Fallback to original vector scores if reranking fails
  • Lazy model loading with error recovery
  • Model caching

Technical Implementation

Modified Files

  • src/utils/vector_search.py - Core reranking logic and pipeline integration
  • src/models/config.py - Configuration schema and validation
  • src/configuration.py - Configuration property access
  • src/constants.py - Default boost factors and settings
  • pyproject.toml - Added sentence-transformers dependency

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement
  • Benchmarks improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: (e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
  • Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • New Features
    • Improved RAG relevance with optional cross-encoder chunk reranking.
    • New reranker settings to customize model, top-K, and boost factors.
    • BYOK chunks receive adjustable score boosting for higher relevance.
    • Automatic reranker activation when both BYOK and OKP RAG sources are configured.
    • Robust fallback: retrieval remains stable if the reranker cannot load or fails.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

Warning

Rate limit exceeded

@Anxhela21 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 22 minutes and 32 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4a541b95-3282-4eca-ab95-12ea02bfe20c

📥 Commits

Reviewing files that changed from the base of the PR and between d664680 and 9eb794a.

📒 Files selected for processing (4)
  • pyproject.toml
  • src/constants.py
  • src/models/config.py
  • tests/unit/models/config/test_reranker_configuration.py

Walkthrough

Adds a cross-encoder reranking pipeline: new reranker configuration model and AppConfig property, sentence-transformers dependency (duplicated in pyproject), BYOK rerank boost constant, cached CrossEncoder loading, score normalization and BYOK boosting, refactors chunk retrieval and integrates reranking into build_rag_context.

Changes

Cohort / File(s) Summary
Dependencies
pyproject.toml
Inserts sentence-transformers>=2.0.0 into [project].dependencies (appears twice / duplicated in the dependency array).
Configuration Models
src/models/config.py, src/configuration.py
Adds RerankerConfiguration model (enabled, model, top_k_multiplier, byok_boost, okp_boost, internal _explicitly_configured) and a post-validator that auto-enables reranker when BYOK and OKP RAG are both present unless explicitly configured. Exposes AppConfig.reranker property.
Constants
src/constants.py
Adds BYOK_RAG_RERANK_BOOST = 1.2.
RAG Pipeline / Vector Search
src/utils/vector_search.py
Implements lazy cached sentence_transformers.CrossEncoder loading; reranks candidate (query, chunk.content) pairs, min-max normalizes scores, supports top_k limiting and fallback to original scores on errors, applies BYOK-specific score boosting, refactors Solr/BYOK retrieval to accept per-call limits, fetches larger candidate pools, merges, optionally reranks, and rebuilds referenced documents from final selected chunks.
Tests
tests/unit/models/config/test_reranker_configuration.py, tests/unit/utils/test_vector_search.py
Adds unit tests for RerankerConfiguration defaults/validation and extensive tests for cross-encoder loading, caching, reranking behavior, normalization, fallback handling, and BYOK boost application.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Builder as build_rag_context
    participant BYOK as _fetch_byok_chunks
    participant Solr as _fetch_solr_chunks
    participant Merge as MergeCandidates
    participant Reranker as CrossEncoderReranker
    participant Booster as BYOKBooster
    participant Selector as SelectTopK
    participant Context as BuildRAGContext

    Client->>Builder: query, configuration
    Builder->>BYOK: fetch (max_chunks = 2×BYOK_MAX)
    BYOK-->>Builder: byok_chunks
    Builder->>Solr: fetch (max_chunks = 2×BYOK_MAX)
    Solr-->>Builder: solr_chunks
    Builder->>Merge: combine candidates
    Merge-->>Builder: merged_candidates
    alt configuration.reranker.enabled
        Builder->>Reranker: (query, merged_candidates)
        Reranker->>Reranker: predict scores for (query, chunk.content)
        Reranker->>Reranker: min-max normalize -> [0,1]
        Reranker-->>Builder: reranked_chunks
        Builder->>Booster: apply BYOK boost (multiply non-OKP scores)
        Booster-->>Builder: boosted_chunks
        Builder->>Selector: sort & select top BYOK_MAX
    else
        Builder->>Selector: sort by original scores & select top BYOK_MAX
    end
    Selector-->>Builder: final_chunks
    Builder->>Context: build referenced_documents from final_chunks
    Context-->>Client: RAGContext
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main feature: implementing cross-encoder reranking to enhance RAG, which is the primary objective of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 98.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Anxhela21 Anxhela21 changed the title cross encoder and reranker for byok and solr Cross-Encoder Reranking for Enhanced RAG Quality Apr 27, 2026
@Anxhela21 Anxhela21 force-pushed the anx/vector_compare branch from e6b6d66 to 880a566 Compare April 27, 2026 15:42
@Anxhela21 Anxhela21 marked this pull request as ready for review April 27, 2026 17:22
@Anxhela21 Anxhela21 changed the title Cross-Encoder Reranking for Enhanced RAG Quality [LCORE-1723] Cross-Encoder Reranking for Enhanced RAG Quality Apr 27, 2026
@Anxhela21 Anxhela21 changed the title [LCORE-1723] Cross-Encoder Reranking for Enhanced RAG Quality [LCORE-1723] Cross-Encoder Reranking Apr 27, 2026
@Anxhela21 Anxhela21 changed the title [LCORE-1723] Cross-Encoder Reranking [LCORE-1723] Cross-Encoder Reranking for enhanced RAG Apr 27, 2026
@Anxhela21 Anxhela21 force-pushed the anx/vector_compare branch from 0131555 to 82caaf6 Compare April 27, 2026 17:38
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Around line 75-79: Remove the duplicate dependency entry for
"sentence-transformers>=2.0.0" in pyproject.toml (the repeated listing with the
same comment), leaving only a single declaration; also check the llslibdev group
requirement (">=5.0.0") and confirm whether the runtime floor should be raised
to ">=5.0.0" to match required features, updating the single remaining
"sentence-transformers" entry accordingly if the code needs 5.0.0+.

In `@src/constants.py`:
- Around line 195-196: Add a Final[float] type annotation to the new constant
BYOK_RAG_RERANK_BOOST so it matches the other constants' immutability checks;
update the declaration for BYOK_RAG_RERANK_BOOST to use Final[float] and ensure
Final is imported from typing at the top of the file if it isn't already.

In `@src/models/config.py`:
- Around line 1828-1866: The current PrivateAttr(_explicitly_configured: bool =
PrivateAttr(default=False)) makes _explicitly_configured always present and the
value-comparison in mark_as_explicitly_configured() misses cases where a user
explicitly provided a field equal to its default; change _explicitly_configured
to a tri-state (e.g. PrivateAttr(default=None)) and rewrite
mark_as_explicitly_configured to use the Pydantic-provided
self.model_fields_set: compute the set of reranker field names you care about
(enabled, model, model_id, provider_id, top_k_multiplier, byok_boost,
okp_boost), check if any of those names are in self.model_fields_set and if so
set self._explicitly_configured = True (leave it None/False otherwise), and
update the consuming code (the hasattr(...) check elsewhere) to test for
_explicitly_configured is not None / True instead of hasattr so the auto-enable
branch can run correctly.
- Around line 1815-1826: The three RerankerConfiguration fields
(top_k_multiplier, byok_boost, okp_boost) are declared but not used; update the
reranking code to read these values from the configuration instead of using
hard-coded constants: replace the hard-coded 2x in the pool_size calculation in
vector_search.py with configuration.reranker.top_k_multiplier when computing
pool_size (or its equivalent variable), change the boost applied in
_apply_byok_rerank_boost to use configuration.reranker.byok_boost instead of
constants.BYOK_RAG_RERANK_BOOST, and ensure OKP chunks receive
configuration.reranker.okp_boost (apply the multiplier where OKP chunks
currently pass through unmodified). Keep existing defaults from the
RerankerConfiguration when reading these fields so behavior is unchanged if the
user omits them.

In `@src/utils/vector_search.py`:
- Around line 28-56: The module-level cache _cross_encoder_models and loader
_get_cross_encoder are not concurrency-safe and can instantiate
CrossEncoder(model_name) multiple times on cold start; fix by serializing loads
for a given model_name (single-flight) — e.g., add a module-level dict of
asyncio.Lock objects (or a per-name asyncio.Event/Future) and acquire the lock
for model_name inside _get_cross_encoder (or convert the loader to a synchronous
function, apply functools.cache, and call it via asyncio.to_thread) so only one
task constructs CrossEncoder(model_name) and others await the result, then store
the loaded model (or None on failure) into _cross_encoder_models before
releasing the lock; ensure any exception handling still sets the cache to None
to avoid repeated retries.
- Around line 722-728: The function build_rag_context currently takes an unused
parameter moderation_decision (annotated with # pylint:
disable=unused-argument); either remove that parameter from the function
signature and delete the pylint disable, and update its three callers (the calls
in the functions at src/app/endpoints/query.py,
src/app/endpoints/streaming_query.py, and src/app/endpoints/responses.py) to
stop passing moderation_decision, or if you intend to use it later, keep the
parameter but replace the pylint disable with a short TODO comment in
build_rag_context explaining the planned future use and why moderation_decision
is intentionally ignored; reference the function name build_rag_context and the
three caller sites when making the change so the signature and call sites remain
consistent.
- Around line 59-134: The async function _rerank_chunks_with_cross_encoder
currently calls the blocking sentence_transformers.CrossEncoder.predict
synchronously (model.predict), which will block the asyncio event loop; change
the call to run in a thread via asyncio.to_thread (e.g., scores = await
asyncio.to_thread(model.predict, pairs)) so inference executes off the event
loop, preserving async responsiveness, and keep the subsequent score processing
logic unchanged; ensure to import asyncio if not already present.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 89ed94e6-8009-41d5-8e4c-3f2892881c37

📥 Commits

Reviewing files that changed from the base of the PR and between 2d0c79a and 82caaf6.

📒 Files selected for processing (5)
  • pyproject.toml
  • src/configuration.py
  • src/constants.py
  • src/models/config.py
  • src/utils/vector_search.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: bandit
  • GitHub Check: build-pr
  • GitHub Check: unit_tests (3.12)
  • GitHub Check: integration_tests (3.12)
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Import FastAPI dependencies with: from fastapi import APIRouter, HTTPException, Request, status, Depends
Import Llama Stack client with: from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
Type aliases defined at module level for clarity
Use Final[type] as type hint for all constants
All functions require docstrings with brief descriptions
Complete type annotations for parameters and return types in functions
Use typing_extensions.Self for model validators in Pydantic models
Use modern union type syntax str | int instead of Union[str, int]
Use Optional[Type] for optional type hints
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Use async def for I/O operations and external API calls
Handle APIConnectionError from Llama Stack in error handling
Use standard log levels with clear purposes: debug, info, warning, error
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Use ABC for abstract base classes with @abstractmethod decorators
Use @model_validator and @field_validator for Pydantic model validation
Complete type annotations for all class attributes; use specific types, not Any
Follow Google Python docstring conventions with Parameters, Returns, Raises, and Attributes sections

Files:

  • src/constants.py
  • src/configuration.py
  • src/models/config.py
  • src/utils/vector_search.py
**/constants.py

📄 CodeRabbit inference engine (AGENTS.md)

Central constants.py for shared constants with descriptive comments

Files:

  • src/constants.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Pydantic models extend ConfigurationBase for config, BaseModel for data models

Files:

  • src/constants.py
  • src/configuration.py
  • src/models/config.py
  • src/utils/vector_search.py
src/**/config*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/config*.py: All config uses Pydantic models extending ConfigurationBase
Base class sets extra="forbid" to reject unknown fields in Pydantic models
Use @field_validator and @model_validator for custom validation in Pydantic models
Use type hints like Optional[FilePath], PositiveInt, SecretStr in Pydantic models

Files:

  • src/configuration.py
  • src/models/config.py
🧠 Learnings (6)
📚 Learning: 2026-03-17T11:34:53.242Z
Learnt from: are-ces
Repo: lightspeed-core/lightspeed-stack PR: 1335
File: docs/config.md:411-411
Timestamp: 2026-03-17T11:34:53.242Z
Learning: In the lightspeed-stack project (`src/models/config.py`, `docs/config.md`), the internal Solr filter `is_chunk:true` (defined as `SOLR_CHUNK_FILTER_QUERY` in `src/constants.py`) is always injected by the system for OKP searches and is intentionally hidden from users. The `chunk_filter_query` field in `OkpConfiguration` is user-facing and additive-only, but the documentation must NOT mention the internal `is_chunk:true` behavior — this is a deliberate design decision by the maintainers.

Applied to files:

  • src/constants.py
  • src/utils/vector_search.py
📚 Learning: 2026-04-19T15:40:25.624Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to src/**/config*.py : All config uses Pydantic models extending `ConfigurationBase`

Applied to files:

  • src/models/config.py
📚 Learning: 2026-04-19T15:40:25.624Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to src/**/*.py : Pydantic models extend `ConfigurationBase` for config, `BaseModel` for data models

Applied to files:

  • src/models/config.py
📚 Learning: 2026-01-12T10:58:40.230Z
Learnt from: blublinsky
Repo: lightspeed-core/lightspeed-stack PR: 972
File: src/models/config.py:459-513
Timestamp: 2026-01-12T10:58:40.230Z
Learning: In lightspeed-core/lightspeed-stack, for Python files under src/models, when a user claims a fix is done but the issue persists, verify the current code state before accepting the fix. Steps: review the diff, fetch the latest changes, run relevant tests, reproduce the issue, search the codebase for lingering references to the original problem, confirm the fix is applied and not undone by subsequent commits, and validate with local checks to ensure the issue is resolved.

Applied to files:

  • src/models/config.py
📚 Learning: 2026-02-25T07:46:33.545Z
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 1211
File: src/models/responses.py:8-16
Timestamp: 2026-02-25T07:46:33.545Z
Learning: In the Python codebase, requests.py should use OpenAIResponseInputTool as Tool while responses.py uses OpenAIResponseTool as Tool. This difference is intentional due to differing schemas for input vs output tools in llama-stack-api. Apply this distinction consistently to other models under src/models (e.g., ensure request-related tools use the InputTool variant and response-related tools use the ResponseTool variant). If adding new tools, choose the corresponding InputTool or Tool class based on whether the tool represents input or output, and document the rationale in code comments.

Applied to files:

  • src/models/config.py
📚 Learning: 2026-01-30T13:33:40.479Z
Learnt from: are-ces
Repo: lightspeed-core/lightspeed-stack PR: 1073
File: src/app/endpoints/query_v2.py:575-586
Timestamp: 2026-01-30T13:33:40.479Z
Learning: In `src/app/endpoints/query_v2.py`, the `parse_referenced_documents_from_responses_api` function filters referenced documents to include only those with `doc_title` or `doc_url` because these documents are displayed in the frontend. Documents with only `doc_id` are intentionally excluded as they wouldn't provide useful information to end users.

Applied to files:

  • src/utils/vector_search.py
🔇 Additional comments (3)
src/utils/vector_search.py (1)

722-746: pool_size/top_k ignore the configured top_k_multiplier.

pool_size = 2 * constants.BYOK_RAG_MAX_CHUNKS hard-codes the 2× factor, but RerankerConfiguration.top_k_multiplier (default 2.0) is intended to govern this exact value. Likewise, _apply_byok_rerank_boost(...) reads the multiplier from constants.BYOK_RAG_RERANK_BOOST, ignoring reranker.byok_boost and reranker.okp_boost. See the comment on src/models/config.py RerankerConfiguration for the full picture and a suggested wiring diff.

src/configuration.py (1)

469-475: LGTM.

The new reranker property follows the same pattern as the surrounding rag/okp properties. Since reranker is a non-optional field with a default_factory in Configuration, the access is safe once _configuration is loaded.

src/models/config.py (1)

1822-1823: Remove unused placeholder fields or wire them up.

model_id and provider_id in RerankerConfiguration are not consumed by src/utils/vector_search.py (only model and enabled are read). If these fields are reserved for future Llama Stack-backed reranker provider support, they should either be wired up or removed to avoid leaking unimplemented schema to users. The Pydantic v2 protected namespace concern does not apply here—the project uses Pydantic v2.10.6+, where the default protected_namespaces setting only flags direct conflicts with methods like model_validate and model_dump, not fields with the model_ prefix.

			> Likely an incorrect or invalid review comment.

Comment thread pyproject.toml Outdated
Comment thread src/constants.py Outdated
Comment thread src/models/config.py Outdated
Comment thread src/models/config.py
Comment on lines +28 to +56
# Lazy-loaded cross-encoder models for reranking RAG chunks (CPU-bound, use in thread).
# Cache models by name to avoid reloading the same model multiple times.
# Not a constant; pylint invalid-name is disabled for this module-level singleton.
_cross_encoder_models: dict[str, Any] = {} # pylint: disable=invalid-name


def _get_cross_encoder(model_name: str) -> Any:
"""Return the lazy-loaded cross-encoder model for reranking.

Args:
model_name: Name of the cross-encoder model to load.

Returns:
Loaded CrossEncoder model instance, or None if loading fails.
"""
if model_name not in _cross_encoder_models:
try:
from sentence_transformers import ( # pylint: disable=import-outside-toplevel
CrossEncoder,
)

_cross_encoder_models[model_name] = CrossEncoder(model_name)
logger.info("Loaded cross-encoder for RAG reranking: %s", model_name)
except Exception as e: # pylint: disable=broad-exception-caught
logger.warning(
"Could not load cross-encoder for reranking (%s): %s", model_name, e
)
_cross_encoder_models[model_name] = None
return _cross_encoder_models[model_name]
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Cross-encoder cache is not safe against concurrent cold-start loads.

_cross_encoder_models is a plain module-level dict. With multiple concurrent requests arriving while the cache is cold (e.g. right after process start), each await-suspended task can re-enter _get_cross_encoder before any prior task has populated the cache, causing CrossEncoder(model_name) to be instantiated several times — wasteful CPU/RAM on startup and possible OOM if the model is large.

A single-flight pattern with asyncio.Lock (or functools.cache if you make the loader sync and only call it via asyncio.to_thread) avoids this:

♻️ Single-flight loader sketch
-_cross_encoder_models: dict[str, Any] = {}  # pylint: disable=invalid-name
+_cross_encoder_models: dict[str, Any] = {}  # pylint: disable=invalid-name
+_cross_encoder_load_lock = asyncio.Lock()
+
+
+async def _get_cross_encoder(model_name: str) -> Any:
+    if model_name in _cross_encoder_models:
+        return _cross_encoder_models[model_name]
+    async with _cross_encoder_load_lock:
+        if model_name in _cross_encoder_models:
+            return _cross_encoder_models[model_name]
+        try:
+            from sentence_transformers import CrossEncoder  # pylint: disable=import-outside-toplevel
+            model = await asyncio.to_thread(CrossEncoder, model_name)
+            _cross_encoder_models[model_name] = model
+            logger.info("Loaded cross-encoder for RAG reranking: %s", model_name)
+        except Exception as e:  # pylint: disable=broad-exception-caught
+            logger.warning("Could not load cross-encoder (%s): %s", model_name, e)
+            _cross_encoder_models[model_name] = None
+    return _cross_encoder_models[model_name]

Note: when running uvicorn with workers > 1, each process still gets its own copy of the model — that is unavoidable here, but worth keeping in mind for memory budgeting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 28 - 56, The module-level cache
_cross_encoder_models and loader _get_cross_encoder are not concurrency-safe and
can instantiate CrossEncoder(model_name) multiple times on cold start; fix by
serializing loads for a given model_name (single-flight) — e.g., add a
module-level dict of asyncio.Lock objects (or a per-name asyncio.Event/Future)
and acquire the lock for model_name inside _get_cross_encoder (or convert the
loader to a synchronous function, apply functools.cache, and call it via
asyncio.to_thread) so only one task constructs CrossEncoder(model_name) and
others await the result, then store the loaded model (or None on failure) into
_cross_encoder_models before releasing the lock; ensure any exception handling
still sets the cache to None to avoid repeated retries.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@syedriko or @tisnik do you think this is a valid concern?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment on lines +59 to +134
async def _rerank_chunks_with_cross_encoder(
query: str,
chunks: list[RAGChunk],
top_k: int,
model_name: str = "cross-encoder/ms-marco-MiniLM-L6-v2",
) -> list[RAGChunk]:
"""Rerank chunks using configurable cross-encoder model.

Args:
query: The search query
chunks: RAG chunks to rerank
top_k: Number of top chunks to return
model_name: Cross-encoder model name to use

Returns:
Top top_k chunks sorted by cross-encoder score (descending)
"""
if not chunks:
return []

try:
# Get the cached cross-encoder model
model = _get_cross_encoder(model_name)
if model is None:
raise RuntimeError(f"Failed to load cross-encoder model: {model_name}")

logger.debug("Using cross-encoder model: %s", model_name)

# Create query-chunk pairs for scoring
pairs = [(query, chunk.content) for chunk in chunks]
scores = model.predict(pairs)

if hasattr(scores, "tolist"):
scores = scores.tolist()

# Normalize scores to [0,1] range using min-max normalization
if len(scores) > 1:
min_score = min(scores)
max_score = max(scores)
score_range = max_score - min_score
if score_range > 0:
normalized_scores = [(score - min_score) / score_range for score in scores]
else:
# All scores are identical, assign 0.5 to all
normalized_scores = [0.5] * len(scores)
else:
# Single score, assign 1.0
normalized_scores = [1.0] * len(scores)

# Combine normalized scores with chunks and sort by score (descending)
indexed = list(zip(normalized_scores, chunks, strict=True))
indexed.sort(key=lambda x: x[0], reverse=True)
top_indexed = indexed[:top_k]

# Return RAGChunk list with normalized cross-encoder scores [0,1]
return [
RAGChunk(
content=chunk.content,
source=chunk.source,
score=float(score),
attributes=chunk.attributes,
)
for score, chunk in top_indexed
]

except Exception as e: # pylint: disable=broad-exception-caught
logger.warning(
"Cross-encoder reranking failed, falling back to original scoring: %s", e
)
# Fallback: sort by original score and take top_k
sorted_chunks = sorted(
chunks,
key=lambda c: c.score if c.score is not None else float("-inf"),
reverse=True,
)
return sorted_chunks[:top_k]
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

sentence-transformers CrossEncoder predict synchronous blocking

💡 Result:

The CrossEncoder.predict method in sentence-transformers performs synchronous, blocking inference by default. It computes similarity scores for input sentence pairs in a blocking manner on the specified device(s), returning results only after completion. Key details: - Input: List of sentence pairs, e.g., [['sentence1a', 'sentence1b'], ['sentence2a', 'sentence2b']]. - Default behavior: Batched processing (batch_size=32), progress bar optional, returns numpy array or torch.Tensor of scores. - Blocking: No asynchronous option mentioned; the call waits for computation to finish. - Multi-device: Supports multiprocessing via device list (e.g., ["cuda:0", "cuda:1"]) or manual pool for parallel GPUs/CPUs, but still synchronous from the caller's perspective. - Example: from sentence_transformers import CrossEncoder model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') scores = model.predict([('Query1', 'Doc1'), ('Query1', 'Doc2')]) # Returns array of scores immediately after computation. This is confirmed in official documentation as the standard synchronous prediction workflow.

Citations:


🏁 Script executed:

# First, locate and inspect the file to confirm line numbers and context
fd -t f "vector_search.py" | head -5

Repository: lightspeed-core/lightspeed-stack

Length of output: 142


🏁 Script executed:

# Check Python version requirement in project
fd -t f -e "pyproject.toml" -o -e "setup.py" -o -e ".python-version" | head -10

Repository: lightspeed-core/lightspeed-stack

Length of output: 193


🏁 Script executed:

# Inspect the vector_search.py file around the identified lines
cat -n src/utils/vector_search.py | sed -n '59,100p'

Repository: lightspeed-core/lightspeed-stack

Length of output: 1768


🏁 Script executed:

# Also check the _get_cross_encoder implementation to assess cold-start blocking
cat -n src/utils/vector_search.py | sed -n '34,56p'

Repository: lightspeed-core/lightspeed-stack

Length of output: 1164


🏁 Script executed:

# Check Python version requirement
cat pyproject.toml | grep -A 5 "python"

Repository: lightspeed-core/lightspeed-stack

Length of output: 309


🏁 Script executed:

# Also check setup.py if it exists
[ -f setup.py ] && cat setup.py | head -30 || echo "No setup.py found"

Repository: lightspeed-core/lightspeed-stack

Length of output: 94


🏁 Script executed:

# Verify the line 28 comment mentioned in the review about CPU-bound
cat -n src/utils/vector_search.py | sed -n '1,35p'

Repository: lightspeed-core/lightspeed-stack

Length of output: 1581


🏁 Script executed:

# Check if asyncio is already imported at the top of the file
head -30 src/utils/vector_search.py | grep -E "^import|^from"

Repository: lightspeed-core/lightspeed-stack

Length of output: 605


model.predict() is synchronous and blocks the event loop inside an async function.

sentence_transformers.CrossEncoder.predict is a CPU-bound PyTorch call that runs synchronously. Calling it directly at line 89 inside async def _rerank_chunks_with_cross_encoder blocks the asyncio event loop for the entire inference duration — preventing the worker from handling other concurrent requests. The module-level comment at line 28 explicitly notes "CPU-bound, use in thread", yet the code does not implement this.

Use asyncio.to_thread (available in Python 3.9+, and your project requires ≥3.12) to offload the blocking inference so the event loop remains responsive.

Proposed fix
         # Create query-chunk pairs for scoring
         pairs = [(query, chunk.content) for chunk in chunks]
-        scores = model.predict(pairs)
+        scores = await asyncio.to_thread(model.predict, pairs)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 59 - 134, The async function
_rerank_chunks_with_cross_encoder currently calls the blocking
sentence_transformers.CrossEncoder.predict synchronously (model.predict), which
will block the asyncio event loop; change the call to run in a thread via
asyncio.to_thread (e.g., scores = await asyncio.to_thread(model.predict, pairs))
so inference executes off the event loop, preserving async responsiveness, and
keep the subsequent score processing logic unchanged; ensure to import asyncio
if not already present.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a valid concern

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@are-ces, agreed — this is a valid issue. The fix is straightforward since asyncio is already imported. Here's the minimal change:

         # Create query-chunk pairs for scoring
         pairs = [(query, chunk.content) for chunk in chunks]
-        scores = model.predict(pairs)
+        scores = await asyncio.to_thread(model.predict, pairs)

As a secondary note, there's also a cold-start blocking concern: on the first call for a given model_name, _get_cross_encoder calls CrossEncoder(model_name) synchronously (model download + load), which also blocks the event loop. You may want to wrap that call in asyncio.to_thread as well:

-        model = _get_cross_encoder(model_name)
+        model = await asyncio.to_thread(_get_cross_encoder, model_name)

The module-level comment even acknowledges this with # CPU-bound, use in thread, so both fixes align with the documented intent.

Comment on lines +722 to 728
async def build_rag_context( # pylint: disable=too-many-locals
client: AsyncLlamaStackClient,
moderation_decision: str,
moderation_decision: str, # pylint: disable=unused-argument
query: str,
vector_store_ids: Optional[list[str]],
solr: Optional[SolrVectorSearchRequest] = None,
) -> RAGContext:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all call sites of build_rag_context to confirm impact of removing the parameter.
rg -nP --type=py -C2 '\bbuild_rag_context\s*\('

Repository: lightspeed-core/lightspeed-stack

Length of output: 2197


🏁 Script executed:

# Read the build_rag_context function implementation
sed -n '722,800p' src/utils/vector_search.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 3006


🏁 Script executed:

# Check the exact line numbers and context for call sites
echo "=== query.py ===" && sed -n '170,185p' src/app/endpoints/query.py
echo -e "\n=== streaming_query.py ===" && sed -n '228,245p' src/app/endpoints/streaming_query.py
echo -e "\n=== responses.py ===" && sed -n '343,360p' src/app/endpoints/responses.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 1836


🏁 Script executed:

# Verify that moderation_decision is not used anywhere in the function body
rg -A30 'async def build_rag_context' src/utils/vector_search.py | grep -i 'moderation_decision'

Repository: lightspeed-core/lightspeed-stack

Length of output: 141


Remove unused moderation_decision parameter or document its intended future use.

The parameter is annotated with # pylint: disable=unused-argument and is not referenced anywhere in the function body (nor in the docstring). The function builds RAG context without consulting moderation status, since moderation checks are already performed by callers before deciding whether to proceed. Remove the parameter and update the three call sites:

  • src/app/endpoints/query.py:179
  • src/app/endpoints/streaming_query.py:234
  • src/app/endpoints/responses.py:349

Alternatively, if there is planned future use, add a TODO comment explaining the intent instead of silently passing the query to vector stores for queries that may have been blocked.

Suggested signature cleanup
 async def build_rag_context(  # pylint: disable=too-many-locals
     client: AsyncLlamaStackClient,
-    moderation_decision: str,  # pylint: disable=unused-argument
     query: str,
     vector_store_ids: Optional[list[str]],
     solr: Optional[SolrVectorSearchRequest] = None,
 ) -> RAGContext:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 722 - 728, The function
build_rag_context currently takes an unused parameter moderation_decision
(annotated with # pylint: disable=unused-argument); either remove that parameter
from the function signature and delete the pylint disable, and update its three
callers (the calls in the functions at src/app/endpoints/query.py,
src/app/endpoints/streaming_query.py, and src/app/endpoints/responses.py) to
stop passing moderation_decision, or if you intend to use it later, keep the
parameter but replace the pylint disable with a short TODO comment in
build_rag_context explaining the planned future use and why moderation_decision
is intentionally ignored; reference the function name build_rag_context and the
three caller sites when making the change so the signature and call sites remain
consistent.

@Anxhela21 Anxhela21 changed the title [LCORE-1723] Cross-Encoder Reranking for enhanced RAG LCORE-1723: Cross-Encoder Reranking for enhanced RAG Apr 27, 2026
@Anxhela21 Anxhela21 force-pushed the anx/vector_compare branch from cce6b82 to d664680 Compare April 27, 2026 18:56
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/utils/vector_search.py (1)

642-685: ⚠️ Potential issue | 🟠 Major

The OKP fetch path never actually expands the rerank candidate pool.

_fetch_solr_rag() accepts max_chunks, but the value never reaches vector_io.query() because _build_query_params(solr) still uses the default k. build_rag_context() also calls _fetch_solr_rag() without max_chunks=pool_size. As a result, OKP stays capped at the default retrieval size, so the reranker never sees the larger Solr pool described in this PR.

♻️ Suggested fix
-            params = _build_query_params(solr)
+            params = _build_query_params(solr, k=limit)
-    solr_chunks_task = _fetch_solr_rag(client, query, solr)
+    solr_chunks_task = _fetch_solr_rag(
+        client, query, solr, max_chunks=pool_size
+    )

Also applies to: 748-755

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 642 - 685, The Solr fetch never
expands the rerank candidate pool because _fetch_solr_rag ignores its max_chunks
arg and _build_query_params still uses the default k; also build_rag_context
doesn't pass pool_size into _fetch_solr_rag. Fix by threading max_chunks
through: update _fetch_solr_rag to pass max_chunks (or computed limit) into the
params returned by _build_query_params (ensure _build_query_params accepts a
k/limit argument and sets params['k'] accordingly), call client.vector_io.query
with those params, and modify build_rag_context to call _fetch_solr_rag(...,
max_chunks=pool_size) so the Solr query and vector_io.query receive the larger
pool_size for reranking (refer to functions _fetch_solr_rag,
_build_query_params, build_rag_context, vector_io.query, and variables
max_chunks/pool_size).
♻️ Duplicate comments (3)
src/utils/vector_search.py (2)

79-90: ⚠️ Potential issue | 🔴 Critical

Offload CrossEncoder.predict() off the event loop.

_rerank_chunks_with_cross_encoder() is async, but model.predict() is a synchronous inference call. Awaiting this path under load will stall the event loop for the full rerank duration and block unrelated requests.

♻️ Minimal fix
-        model = _get_cross_encoder(model_name)
+        model = await _get_cross_encoder(model_name)
@@
-        scores = model.predict(pairs)
+        scores = await asyncio.to_thread(model.predict, pairs)
According to the official sentence-transformers documentation, is `CrossEncoder.predict(...)` a synchronous/blocking call, and is `asyncio.to_thread(...)` the appropriate way to invoke it from Python async code?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 79 - 90,
_rerank_chunks_with_cross_encoder is async but calls the synchronous
CrossEncoder.predict (via model.predict) which blocks the event loop; change the
call to run the blocking inference off the loop using asyncio.to_thread (or
loop.run_in_executor) so prediction runs in a worker thread and returns
awaitably. Locate the call to model.predict in _rerank_chunks_with_cross_encoder
and replace the direct call with an awaited asyncio.to_thread(lambda:
model.predict(pairs)) (or equivalent) so the CrossEncoder.predict invocation is
executed off the event loop.

34-56: ⚠️ Potential issue | 🟠 Major

Serialize cold-start loads for each cross-encoder model.

This cache is still racy: concurrent requests can all miss _cross_encoder_models and instantiate the same model before the first load stores it, which is an expensive CPU/RAM spike for large rerankers. Wrap model construction in a per-model single-flight lock so only one task loads each model name.

♻️ Suggested single-flight pattern
 _cross_encoder_models: dict[str, Any] = {}  # pylint: disable=invalid-name
+_cross_encoder_load_locks: dict[str, asyncio.Lock] = {}  # pylint: disable=invalid-name
 
 
-def _get_cross_encoder(model_name: str) -> Any:
+async def _get_cross_encoder(model_name: str) -> Any:
     """Return the lazy-loaded cross-encoder model for reranking.
@@
-    if model_name not in _cross_encoder_models:
-        try:
-            from sentence_transformers import (  # pylint: disable=import-outside-toplevel
-                CrossEncoder,
-            )
-
-            _cross_encoder_models[model_name] = CrossEncoder(model_name)
-            logger.info("Loaded cross-encoder for RAG reranking: %s", model_name)
-        except Exception as e:  # pylint: disable=broad-exception-caught
-            logger.warning(
-                "Could not load cross-encoder for reranking (%s): %s", model_name, e
-            )
-            _cross_encoder_models[model_name] = None
+    if model_name in _cross_encoder_models:
+        return _cross_encoder_models[model_name]
+
+    lock = _cross_encoder_load_locks.setdefault(model_name, asyncio.Lock())
+    async with lock:
+        if model_name in _cross_encoder_models:
+            return _cross_encoder_models[model_name]
+        try:
+            from sentence_transformers import (  # pylint: disable=import-outside-toplevel
+                CrossEncoder,
+            )
+
+            _cross_encoder_models[model_name] = await asyncio.to_thread(
+                CrossEncoder, model_name
+            )
+            logger.info("Loaded cross-encoder for RAG reranking: %s", model_name)
+        except Exception as e:  # pylint: disable=broad-exception-caught
+            logger.warning(
+                "Could not load cross-encoder for reranking (%s): %s", model_name, e
+            )
+            _cross_encoder_models[model_name] = None
     return _cross_encoder_models[model_name]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 34 - 56, The _get_cross_encoder
function is racy because multiple callers can simultaneously miss
_cross_encoder_models and each instantiate the same heavy model; fix it by
introducing a per-model single-flight lock (e.g., a module-level dict
_cross_encoder_locks: Dict[str, threading.Lock]) and wrap the check-and-create
logic for a given model_name inside that lock: obtain or create the lock for
model_name (use setdefault), acquire it (with context manager) before
re-checking _cross_encoder_models[model_name], then instantiate CrossEncoder and
assign to _cross_encoder_models[model_name] inside the locked region (ensure
exceptions still set the cache to None and release the lock in finally). Update
references to _get_cross_encoder, _cross_encoder_models, and logger accordingly.
src/models/config.py (1)

1829-1861: ⚠️ Potential issue | 🟠 Major

_explicitly_configured never makes the auto-enable branch work as intended.

PrivateAttr(default=False) means _explicitly_configured is always present, so not hasattr(self.reranker, "_explicitly_configured") is always false and the auto-enable path never runs. The value-comparison validator also misses explicit input that matches defaults, e.g. reranker: { enabled: true }. Use model_fields_set to mark explicit input and test the flag value directly.

♻️ Suggested fix
 class RerankerConfiguration(ConfigurationBase):
@@
     _explicitly_configured: bool = PrivateAttr(default=False)
@@
     `@model_validator`(mode="after")
     def mark_as_explicitly_configured(self) -> Self:
         """Mark this configuration as explicitly set when instantiated from user input."""
-        # Only mark as explicitly configured if we're not using all default values
-        # This allows auto-enabling when user hasn't touched reranker settings
-        # Check if any field differs from default values
-        default_model = "cross-encoder/ms-marco-MiniLM-L6-v2"
-        default_top_k_multiplier = 2.0
-        default_byok_boost = 1.2
-        default_okp_boost = 1.0
-
-        # Check if any setting differs from defaults (indicates explicit configuration)
-        current_values = [
-            self.enabled,
-            self.model,
-            self.top_k_multiplier,
-            self.byok_boost,
-            self.okp_boost,
-        ]
-        default_values = [
-            True,
-            default_model,
-            default_top_k_multiplier,
-            default_byok_boost,
-            default_okp_boost,
-        ]
-
-        if current_values != default_values:
+        if self.model_fields_set:
             self._explicitly_configured = True
         return self
@@
-        if (
-            has_byok
-            and has_okp
-            and not hasattr(self.reranker, "_explicitly_configured")
-        ):
+        if has_byok and has_okp and not self.reranker._explicitly_configured:
             # pylint: disable=no-member
             if not self.reranker.enabled:
                 logger.info(
                     "Automatically enabling reranker: Both BYOK RAG (%d entries) and OKP "
                     "are configured. Reranking improves result quality when multiple "
In Pydantic v2, do `PrivateAttr(default=...)` attributes always exist on model instances, and does `model_fields_set` track fields explicitly provided in the input even when the provided value equals the field default?

Also applies to: 2154-2168

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/models/config.py` around lines 1829 - 1861, The _explicitly_configured
flag is never meaningful because PrivateAttr always exists; update the
mark_as_explicitly_configured validator (and the similar validator at lines
~2154-2168) to use model_fields_set to detect which fields were explicitly
provided and set self._explicitly_configured = True when any reranker-related
fields (e.g., "enabled", "model", "top_k_multiplier", "byok_boost", "okp_boost")
appear in self.model_fields_set or when the parent model passed reranker in its
input; also change any checks that use hasattr(self.reranker,
"_explicitly_configured") to inspect the flag's value directly
(self.reranker._explicitly_configured) so the auto-enable branch can run when
the user explicitly provided values even if they match defaults.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/utils/vector_search.py`:
- Around line 182-205: The loop that builds ReferencedDocument from rag_chunks
currently appends entries even when both title and URL are missing; update the
logic in the loop (the section that computes attrs, doc_url, parsed_url and
before result.append) to skip/continue when both attrs.get("title") is falsy and
parsed_url is falsy/None so you only append ReferencedDocument when there is a
doc_title or a doc_url; keep existing dedup_key/seen handling and AnyUrl parsing
but add a guard (using the computed title and parsed_url) to avoid creating
empty ReferencedDocument objects.

---

Outside diff comments:
In `@src/utils/vector_search.py`:
- Around line 642-685: The Solr fetch never expands the rerank candidate pool
because _fetch_solr_rag ignores its max_chunks arg and _build_query_params still
uses the default k; also build_rag_context doesn't pass pool_size into
_fetch_solr_rag. Fix by threading max_chunks through: update _fetch_solr_rag to
pass max_chunks (or computed limit) into the params returned by
_build_query_params (ensure _build_query_params accepts a k/limit argument and
sets params['k'] accordingly), call client.vector_io.query with those params,
and modify build_rag_context to call _fetch_solr_rag(..., max_chunks=pool_size)
so the Solr query and vector_io.query receive the larger pool_size for reranking
(refer to functions _fetch_solr_rag, _build_query_params, build_rag_context,
vector_io.query, and variables max_chunks/pool_size).

---

Duplicate comments:
In `@src/models/config.py`:
- Around line 1829-1861: The _explicitly_configured flag is never meaningful
because PrivateAttr always exists; update the mark_as_explicitly_configured
validator (and the similar validator at lines ~2154-2168) to use
model_fields_set to detect which fields were explicitly provided and set
self._explicitly_configured = True when any reranker-related fields (e.g.,
"enabled", "model", "top_k_multiplier", "byok_boost", "okp_boost") appear in
self.model_fields_set or when the parent model passed reranker in its input;
also change any checks that use hasattr(self.reranker, "_explicitly_configured")
to inspect the flag's value directly (self.reranker._explicitly_configured) so
the auto-enable branch can run when the user explicitly provided values even if
they match defaults.

In `@src/utils/vector_search.py`:
- Around line 79-90: _rerank_chunks_with_cross_encoder is async but calls the
synchronous CrossEncoder.predict (via model.predict) which blocks the event
loop; change the call to run the blocking inference off the loop using
asyncio.to_thread (or loop.run_in_executor) so prediction runs in a worker
thread and returns awaitably. Locate the call to model.predict in
_rerank_chunks_with_cross_encoder and replace the direct call with an awaited
asyncio.to_thread(lambda: model.predict(pairs)) (or equivalent) so the
CrossEncoder.predict invocation is executed off the event loop.
- Around line 34-56: The _get_cross_encoder function is racy because multiple
callers can simultaneously miss _cross_encoder_models and each instantiate the
same heavy model; fix it by introducing a per-model single-flight lock (e.g., a
module-level dict _cross_encoder_locks: Dict[str, threading.Lock]) and wrap the
check-and-create logic for a given model_name inside that lock: obtain or create
the lock for model_name (use setdefault), acquire it (with context manager)
before re-checking _cross_encoder_models[model_name], then instantiate
CrossEncoder and assign to _cross_encoder_models[model_name] inside the locked
region (ensure exceptions still set the cache to None and release the lock in
finally). Update references to _get_cross_encoder, _cross_encoder_models, and
logger accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 32b1995a-8892-4d14-a537-53427a6d3620

📥 Commits

Reviewing files that changed from the base of the PR and between 82caaf6 and d664680.

📒 Files selected for processing (7)
  • pyproject.toml
  • src/configuration.py
  • src/constants.py
  • src/models/config.py
  • src/utils/vector_search.py
  • tests/unit/models/config/test_reranker_configuration.py
  • tests/unit/utils/test_vector_search.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: build-pr
  • GitHub Check: unit_tests (3.13)
  • GitHub Check: unit_tests (3.12)
  • GitHub Check: integration_tests (3.12)
  • GitHub Check: integration_tests (3.13)
  • GitHub Check: Pylinter
  • GitHub Check: mypy
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Import FastAPI dependencies with: from fastapi import APIRouter, HTTPException, Request, status, Depends
Import Llama Stack client with: from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
Type aliases defined at module level for clarity
Use Final[type] as type hint for all constants
All functions require docstrings with brief descriptions
Complete type annotations for parameters and return types in functions
Use typing_extensions.Self for model validators in Pydantic models
Use modern union type syntax str | int instead of Union[str, int]
Use Optional[Type] for optional type hints
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Use async def for I/O operations and external API calls
Handle APIConnectionError from Llama Stack in error handling
Use standard log levels with clear purposes: debug, info, warning, error
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Use ABC for abstract base classes with @abstractmethod decorators
Use @model_validator and @field_validator for Pydantic model validation
Complete type annotations for all class attributes; use specific types, not Any
Follow Google Python docstring conventions with Parameters, Returns, Raises, and Attributes sections

Files:

  • src/constants.py
  • src/configuration.py
  • tests/unit/models/config/test_reranker_configuration.py
  • src/models/config.py
  • tests/unit/utils/test_vector_search.py
  • src/utils/vector_search.py
**/constants.py

📄 CodeRabbit inference engine (AGENTS.md)

Central constants.py for shared constants with descriptive comments

Files:

  • src/constants.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Pydantic models extend ConfigurationBase for config, BaseModel for data models

Files:

  • src/constants.py
  • src/configuration.py
  • src/models/config.py
  • src/utils/vector_search.py
src/**/config*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/config*.py: All config uses Pydantic models extending ConfigurationBase
Base class sets extra="forbid" to reject unknown fields in Pydantic models
Use @field_validator and @model_validator for custom validation in Pydantic models
Use type hints like Optional[FilePath], PositiveInt, SecretStr in Pydantic models

Files:

  • src/configuration.py
  • src/models/config.py
tests/{unit,integration}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/{unit,integration}/**/*.py: Use pytest for all unit and integration tests
Do not use unittest; pytest is the standard for this project
Use pytest-mock for AsyncMock objects in tests
Use marker pytest.mark.asyncio for async tests
Unit tests require 60% coverage, integration tests 10%

Files:

  • tests/unit/models/config/test_reranker_configuration.py
  • tests/unit/utils/test_vector_search.py
🧠 Learnings (9)
📚 Learning: 2026-04-19T15:40:25.624Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to **/*.py : Use Final[type] as type hint for all constants

Applied to files:

  • src/constants.py
📚 Learning: 2026-04-19T15:40:25.624Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to **/constants.py : Central `constants.py` for shared constants with descriptive comments

Applied to files:

  • src/constants.py
📚 Learning: 2026-03-17T11:34:53.242Z
Learnt from: are-ces
Repo: lightspeed-core/lightspeed-stack PR: 1335
File: docs/config.md:411-411
Timestamp: 2026-03-17T11:34:53.242Z
Learning: In the lightspeed-stack project (`src/models/config.py`, `docs/config.md`), the internal Solr filter `is_chunk:true` (defined as `SOLR_CHUNK_FILTER_QUERY` in `src/constants.py`) is always injected by the system for OKP searches and is intentionally hidden from users. The `chunk_filter_query` field in `OkpConfiguration` is user-facing and additive-only, but the documentation must NOT mention the internal `is_chunk:true` behavior — this is a deliberate design decision by the maintainers.

Applied to files:

  • src/constants.py
  • src/models/config.py
  • tests/unit/utils/test_vector_search.py
  • src/utils/vector_search.py
📚 Learning: 2026-04-19T15:40:25.624Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Always check `pyproject.toml` for existing dependencies before adding new ones

Applied to files:

  • pyproject.toml
📚 Learning: 2026-04-19T15:40:25.624Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to src/**/config*.py : All config uses Pydantic models extending `ConfigurationBase`

Applied to files:

  • tests/unit/models/config/test_reranker_configuration.py
  • src/models/config.py
📚 Learning: 2026-04-19T15:40:25.624Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to src/**/*.py : Pydantic models extend `ConfigurationBase` for config, `BaseModel` for data models

Applied to files:

  • src/models/config.py
📚 Learning: 2026-01-12T10:58:40.230Z
Learnt from: blublinsky
Repo: lightspeed-core/lightspeed-stack PR: 972
File: src/models/config.py:459-513
Timestamp: 2026-01-12T10:58:40.230Z
Learning: In lightspeed-core/lightspeed-stack, for Python files under src/models, when a user claims a fix is done but the issue persists, verify the current code state before accepting the fix. Steps: review the diff, fetch the latest changes, run relevant tests, reproduce the issue, search the codebase for lingering references to the original problem, confirm the fix is applied and not undone by subsequent commits, and validate with local checks to ensure the issue is resolved.

Applied to files:

  • src/models/config.py
📚 Learning: 2026-02-25T07:46:33.545Z
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 1211
File: src/models/responses.py:8-16
Timestamp: 2026-02-25T07:46:33.545Z
Learning: In the Python codebase, requests.py should use OpenAIResponseInputTool as Tool while responses.py uses OpenAIResponseTool as Tool. This difference is intentional due to differing schemas for input vs output tools in llama-stack-api. Apply this distinction consistently to other models under src/models (e.g., ensure request-related tools use the InputTool variant and response-related tools use the ResponseTool variant). If adding new tools, choose the corresponding InputTool or Tool class based on whether the tool represents input or output, and document the rationale in code comments.

Applied to files:

  • src/models/config.py
📚 Learning: 2026-01-30T13:33:40.479Z
Learnt from: are-ces
Repo: lightspeed-core/lightspeed-stack PR: 1073
File: src/app/endpoints/query_v2.py:575-586
Timestamp: 2026-01-30T13:33:40.479Z
Learning: In `src/app/endpoints/query_v2.py`, the `parse_referenced_documents_from_responses_api` function filters referenced documents to include only those with `doc_title` or `doc_url` because these documents are displayed in the frontend. Documents with only `doc_id` are intentionally excluded as they wouldn't provide useful information to end users.

Applied to files:

  • src/utils/vector_search.py

Comment on lines +182 to +205
for chunk in rag_chunks:
attrs = chunk.attributes or {}
doc_url = (
attrs.get("reference_url") or attrs.get("doc_url") or attrs.get("docs_url")
)
doc_id = attrs.get("document_id") or attrs.get("doc_id")
dedup_key = doc_url or doc_id or chunk.source or ""
if not dedup_key or dedup_key in seen:
continue
seen.add(dedup_key)
parsed_url: Optional[AnyUrl] = None
if doc_url:
try:
parsed_url = AnyUrl(doc_url)
except Exception: # pylint: disable=broad-exception-caught
parsed_url = None
result.append(
ReferencedDocument(
doc_title=attrs.get("title"),
doc_url=parsed_url,
source=chunk.source,
)
)
return result
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Skip referenced documents that have neither title nor URL.

This helper now emits ReferencedDocument objects even when both doc_title and doc_url are missing. Those entries are intentionally filtered out elsewhere because the frontend cannot render them usefully, so this regresses the final referenced-doc list after reranking.

♻️ Suggested guard
     for chunk in rag_chunks:
         attrs = chunk.attributes or {}
         doc_url = (
             attrs.get("reference_url") or attrs.get("doc_url") or attrs.get("docs_url")
         )
+        title = attrs.get("title")
         doc_id = attrs.get("document_id") or attrs.get("doc_id")
         dedup_key = doc_url or doc_id or chunk.source or ""
         if not dedup_key or dedup_key in seen:
             continue
         seen.add(dedup_key)
@@
-        result.append(
-            ReferencedDocument(
-                doc_title=attrs.get("title"),
-                doc_url=parsed_url,
-                source=chunk.source,
-            )
-        )
+        if title is None and parsed_url is None:
+            continue
+        result.append(
+            ReferencedDocument(
+                doc_title=title,
+                doc_url=parsed_url,
+                source=chunk.source,
+            )
+        )

Based on learnings: In src/app/endpoints/query_v2.py, the parse_referenced_documents_from_responses_api function filters referenced documents to include only those with doc_title or doc_url because these documents are displayed in the frontend.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 182 - 205, The loop that builds
ReferencedDocument from rag_chunks currently appends entries even when both
title and URL are missing; update the logic in the loop (the section that
computes attrs, doc_url, parsed_url and before result.append) to skip/continue
when both attrs.get("title") is falsy and parsed_url is falsy/None so you only
append ReferencedDocument when there is a doc_title or a doc_url; keep existing
dedup_key/seen handling and AnyUrl parsing but add a guard (using the computed
title and parsed_url) to avoid creating empty ReferencedDocument objects.

Signed-off-by: Anxhela Coba <acoba@redhat.com>
Copy link
Copy Markdown
Contributor

@are-ces are-ces left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example config in the PR description must be outdated because

top_k_multiplier: 2.0 # Fetch 2x chunks, rerank, keep top_k
byok_boost: 1.2 # Boost factor for BYOK sources
okp_boost: 1.0 # Boost factor for OKP sources

are missing from the configuration in the PR.

Would be nice to see a configured reranker in the example configuration examples/lightspeed-stack-byok-okp-rag.yaml.

Please update unit and integration tests; fix Black and Linting failing tests. TY!


# Create query-chunk pairs for scoring
pairs = [(query, chunk.content) for chunk in chunks]
scores = model.predict(pairs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As CodeRabbit points out, this is synchronous and blocks the process, please make async

solr_chunks_task = _fetch_solr_rag(client, query, solr)

(byok_chunks, byok_docs), (solr_chunks, solr_docs) = await asyncio.gather(
(byok_chunks, _), (solr_chunks, _) = await asyncio.gather(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new _referenced_documents_from_rag_chunks (line 169) duplicates the same extraction logic already present in _process_byok_rag_chunks_for_documents and _process_solr_chunks_for_documents.

On top of that, those two per-source functions are effectively dead code in the current flow, their return values are discarded on line 754:

(byok_chunks, _), (solr_chunks, _) = await asyncio.gather(
    byok_chunks_task, solr_chunks_task
)

...and then _referenced_documents_from_rag_chunks re-extracts the same documents from the final chunks at line 785.

Suggestion: Instead of discarding the per-source documents and re-extracting them, keep the documents returned by _fetch_byok_rag and _fetch_solr_rag, then after reranking, filter/reorder them to match whichever chunks survived the rerank. This avoids duplicating the extraction logic and is more efficient — one lookup pass instead of a second extraction. _referenced_documents_from_rag_chunks can then be removed.

Comment thread src/models/config.py
# pylint: disable=no-member
has_okp = (
constants.OKP_RAG_ID in self.rag.inline
or constants.OKP_RAG_ID in self.rag.tool
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should the re-ranked be active if OKP is in RAG as a Tool?

Copy link
Copy Markdown
Contributor Author

@Anxhela21 Anxhela21 Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is looking if okp is enabled either as inline or tool since we can configure okp either way unless I'm mistaken?

Comment thread src/models/config.py
Self: The validated configuration instance with reranker potentially enabled.
"""
# Check if BYOK RAG entries are configured
has_byok = len(self.byok_rag) > 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also check that there are any entries registered in inline RAG

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good point!

Returns:
RAGContext containing formatted context text and referenced documents
"""
if moderation_decision == "blocked":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed? This if was there to prevent doing RAG in case of an unsafe input

client: AsyncLlamaStackClient,
query: str,
solr: Optional[SolrVectorSearchRequest] = None,
max_chunks: Optional[int] = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter is not used

byok_chunks_task = _fetch_byok_rag(
client, query, vector_store_ids, max_chunks=pool_size
)
solr_chunks_task = _fetch_solr_rag(client, query, solr)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not increasing the pool size for OKP chunks? I might miss something here...

"""
if model_name not in _cross_encoder_models:
try:
from sentence_transformers import ( # pylint: disable=import-outside-toplevel
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dep should be moved on top

Comment thread pyproject.toml
# To be able to fix multiple CVEs, also LCORE-1117
"requests>=2.33.0",
# Used for RAG chunk reranking (cross-encoder)
"sentence-transformers>=2.0.0",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the sentence-transformer dependency be added @syedriko @tisnik? In that case uv.lock will need to be updated too

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it needs to be updated to pass CI etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks missed that - will update uv.lock

Comment thread src/models/config.py
class RerankerConfiguration(ConfigurationBase):
"""Reranker configuration for RAG chunk reranking."""

enabled: bool = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can be initialized as Field so it would be possible to specify title and description.

Comment thread pyproject.toml
# To be able to fix multiple CVEs, also LCORE-1117
"requests>=2.33.0",
# Used for RAG chunk reranking (cross-encoder)
"sentence-transformers>=2.0.0",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it needs to be updated to pass CI etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants